library(tidyverse)
library(maptools)
library(sp)
library(rgdal)
library(choroplethrMaps)
library(choroplethr)
library(ggridges)
library(forcats)
library(GGally)
library(gridExtra)
library(extracat)
library(plotly)

I. Introduction

II. Description of the data source

We used Community Health Status Indicators (CHSI) dataset, which was published by Centers for Disease Control and Prevention on data.gov website. It was created on June 19, 2015, and updated on February 26, 2019. There were 9 tables and contained over 200 measures for each of the 3,141 United States counties. We mainly used four tables in this file.

Tables Description

Table1 (Demographics): “demographics.csv”

This table provided an overall characteristic of each county. The data were obtained from the Current Population Survey (CPS) conducted by the U.S. Bureau of the Census. The CPS is an ongoing survey of states from which estimates for counties are derived.

Table2 (Health Record): “summary measures of health.csv”

This table contained broad measures of health. Each measure captures a single, comprehensive measure of population-based health. Those measures were generated from different sources. Average Life Expectancy was calculated by Chris Murray and colleagues at the Harvard School of Public Health. All Causes of Death came from the National Vital Statistics System, National Center for Health Statistics. Health Status and Average Number of Unhealthy Days are provided by the Behavioral Risk Factor Surveillance Systemctors, a survey conducted jointly by states and the Centers for Disease Control and Prevention.

Table3 (Risk Factors): “risk factors and access to care.csv”

This table provides information on the prevalence of adult risk characteristics associated with the leading causes of death. It was conducted by The Behavioral Risk Factor Surveillance System (BRFSS), a survey conducted jointly by states and the Centers for Disease Control and Prevention.

Table4 (Diseases): “measures of birth and death.csv”

This table contained death rate due to certain diseases. We mainly used the mortality data, and they are coming from the National Center for Health Statistics, National Vital Statistics System.

Variables We Used:

Table1 (Demographics): Population Size(Population_Size); Poverty level(Poverty).
Table2 (Health Record): Average Life Expectancy(ALE); All Causes of Death(All_Death); Self-rated Health Status(Health_Status); Average Number of Unhealthy Days in Past Month(Unhealthy_Days).
Table3 (Risk Factors): No exercise(No_Exercise); Few fruits/vegetables(Few_Fruit_Veg); Obesity(Obesity); High Blood Pressure(High_Blood_Pres); Smoker(Smoker); Diabetes(Diabetes).
Table4 (Diseases): Breast Cancer (Female)(Brst_Cancer); Colon Cancer(Col_Cancer); Coronary Heart Disease(CHD); Lung Cancer(Lung_Cancer); Stroke(Stroke).

Known Issues

Some data were estimated using small sample size, so the data were not really accurate. The confidence interval for these measures may be wide. In addition, due to huge numbers of county and large numbers of variables, there were many missing values (no data available / no report).

III. Description of data import / cleaning / transformation

## Data
Demo<-read.csv("demographics.csv")
Health.Summary<-read.csv("summary measures of health.csv")
Risk<-read.csv("risk factors and access to care.csv")
Birth.Death<-read.csv("measures of birth and death.csv")


Health<-cbind(Demo[,c(1:6,9,12,15,30,33,36,39,42)], Health.Summary[,c(7,11,17,23)],Risk[,c(7,10,13,16,19,22)],Birth.Death[,c(85,91,97,109,121)])

## Identify NA
for (i in 7:29){
  Health[,i]<-ifelse(Health[,i]<0,NA,Health[,i])
}


## Health by State
Health.State<-Health %>% 
  group_by(CHSI_State_Name) %>% 
  summarize(State_FIPS_Code=mean(State_FIPS_Code),
            Population.Size=sum(Population_Size),
            Poverty=sum(Population_Size*0.01*Poverty, na.rm=TRUE)/sum(Population_Size,na.rm = TRUE)*100, 
            Life_exp=mean(ALE, na.rm=TRUE),
            All_Death=sum(All_Death, na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Health_Status=sum(Population_Size*0.01*Health_Status,na.rm=TRUE)/sum(Population_Size)*100,
            Unhealthy_Days=mean(Unhealthy_Days,na.rm=TRUE),
            No_Exercise=sum(Population_Size*0.01*No_Exercise, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Few_Fruit_Veg=sum(Population_Size*0.01*Few_Fruit_Veg, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Obese=sum(Population_Size*0.01*Obesity, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            High_Blood_Pres=sum(Population_Size*0.01*High_Blood_Pres, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Smoker=sum(Population_Size*0.01*Smoker, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Diabetes=sum(Population_Size*0.01*Diabetes, na.rm=TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Brst_Cancer=sum(Brst_Cancer,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Col_Cancer=sum(Col_Cancer,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            CHD=sum(CHD,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Lung_Cancer=sum(Lung_Cancer,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100,
            Stroke=sum(Stroke,na.rm = TRUE)/sum(Population_Size, na.rm = TRUE)*100)
Health.State[,c(4:19)]<-round(Health.State[,c(4:19)],2)
Health.State$CHSI_State_Abbr<-names(table(Health$CHSI_State_Abbr))
Health.State[Health.State$CHSI_State_Name=="Alaska",]<-ifelse(Health.State[Health.State$CHSI_State_Name=="Alaska",]==0, NA, Health.State[Health.State$CHSI_State_Name=="Alaska",])
write.csv(Health.State,"Health.State.csv")
Using “Health.csv”, we developed another table “Health.State.csv,”which contains the information for each state. Within each state, we calculated the average rate for each variable. To calculate the average rate, we first used population size of each county times the rate (of each county) to get the number of people for each variable, then using the number of people (of each state) of each variable divided by the state’s population to get the state average of each variable. In this table, only Alaska had several missing values.

IV. Analysis of missing values

a. Missing Values from ‘Health’ Dataset

visna(Health, sort = "b")

In “Health.csv”, there were many missing values, since there are 3141 counties and it is hard to collect all the data. When we look vertically, among those variables, table3–risk factors is the one that had the highest number of missing values. In particular, around 50% of counties did not have High Blood Pressure(High_Blood_Pres) record. If we looked horizontally, we can tell that most counties had complete record.

b. Missing Values from ‘Health.State’ Dataset

visna(Health.State, sort = "b")

In “Health.State.csv”, only one state (Alaska) had missing values. It missed eight varibales, and most of them are risk factors.

V. Results

I. Relationship Between all Variables

a. Correlation Plot of All Variables

## correlation grid
ggcorr(Health.State[,-c(1:2,20)], label = TRUE, label_round = 2, hjust = 0.75, size = 3, label_size = 2, title="Correlation Plot")+labs(title = "Correlation Plot")

This plot shows the correlation between each pair of variables. Red represents positive correlation and blue means negative correlation. The lighter color means a weaker relationship. From the above grid, here are some of our key findings.
1. Poverty rate has strong correlation with life expectancy, Health status, and some risk factors (No exercise, Obesity, Diabetes). In particular, higher rate of poverty may imply shorter life expectancy, higher rate of poor health status, more unhealthy days in the past month, and higher rate of risk factors.
2. Health status are highly correlated with risk factors. The states with high rate of risk factors tends to have higher unhealthy rate.
4. To our suprise, life expectancy has high correlation with risk factors, but not diseases.
5. High blood pressure has negative correlation with the listed diseases (Breast Cancer (Female), Colon Cancer, Coronary Heart Disease, Lung Cancer, Stroke). Few fruits/vegetables, Obese and Smoker has moderate correlation with Diseases.
6. Most risk factors are correlated with each other. One exception is that Few fruits/vegetables has less correlation Diabetes.
7. Diseases are highly correlated. It means the state with high rate of one disease is likely to have high rate of other diseases also. me direction of arrow.

b. Scatter Plot of Risk Factors and Diseases Death

tidydf<-gather(Health.State, key = "Diseases", value = "Percentage", colnames(Health.State%>%select(15:19)))

## Scatter plot (no exercise)
sc1<-ggplot(data = tidydf, mapping = aes(x = No_Exercise, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)

## Scatter plot (few fruit and veg)
sc2<-ggplot(data = tidydf, mapping = aes(x = Few_Fruit_Veg, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)

## Scatter plot (Obesity)
sc3<-ggplot(data = tidydf, mapping = aes(x = Obese, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)

## Scatter plot (High blood pre)
sc4<-ggplot(data = tidydf, mapping = aes(x = High_Blood_Pres, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)

## Scatter plot (Smoker)
sc5<-ggplot(data = tidydf, mapping = aes(x = Smoker, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)

## Scatter plot (Diabetes)
sc6<-ggplot(data = tidydf, mapping = aes(x =Diabetes, y = Percentage)) +
  geom_point(aes(color = Diseases),size=0.7,alpha=0.7)

grid.arrange(sc1, sc2,sc3,sc4,sc5,sc6, nrow = 3)

From this set of plot, we can find that there is no obvious relationship between risk factors and diseases. But we can still find some tiny influence of risk factors on diseases. For example, we can find that when the percentage of few fruits and vegetables and the percentage of smoker go up, there are more percentage of diseases leading to death.

II. Risk Factors Analysis of the United States

a.Distribution of Risk Factor

### boxplot
New<-gather(Health, key="Categories", value = "Percentage", colnames(Health%>%select(19:24)))
New<-na.omit(New[,c(24,25)])
ggplot(New) + geom_boxplot(aes(y=Percentage,x=reorder(Categories, Percentage, median)))+
  ylab("Percentage")+xlab("Risk Factors")+ggtitle("Boxplot of Risk Factors")+
  coord_flip()

## ridge
ggplot(New)+
  geom_density_ridges(aes(x=New$Percentage, y=reorder(New$Categories, New$Percentage, median)))+xlab("Percentage")+ylab("Risk Factors")+ggtitle("Ridgeline Plot of Risk Factors")

Both boxplot and ridgeline plot show the distribution of risk factors, taking counties as observations.
1. Few fruits/vegetables is a serious problem in the United States. The overall rate is higher compare to other risk factors. Most counties have high rate of Few fruits/vegetables.
2. Lower rate of people have diabetes. In addition, the distribution has higher peak and small spread, meaning that the rate of diabetes among all counties are pretty similar.

b.Relationship between Risk Factors and all States (PCA)

## PCA biplot
HS.pc<- as.data.frame(Health.State[,c(9:14)])
rownames(HS.pc) <- Health.State$CHSI_State_Name
pr.out<-prcomp(na.omit(HS.pc),scale=TRUE)
biplot(pr.out, expand=1.3, cex=c(0.6,0.6), main="Biplot of risk factors (Scaled)", xlim=c(-0.4, 0.4),
 ylim=c(-0.5,0.5), col=c("orange","blue"))

The biplot is an enhanced scatterplot that uses both points and vectors to represent structure. Samples are displayed as points while variables are displayed as vectors. It applies principal components analysis, which helps us display high dimensional data with lower dimensions (in this case, two dimensions), without losing key characteristics. The axes of a biplot are a pair of principal components. If two points are close to each other, meaning they are alike in risk factors. The arrows represent the variables. Along the direction means higher rate.
1. Most arrows point toward the same direction, meaning that those variables are highly correlated, besides the factor Few fruits/vegetables.
2. The states on the opposite direction of the arrow, like Colorado and Minnesota, has lower rate of risk factors.
3. West Virginia has really high risk factor rates since it locates in the same direction of arrow.

III. Relationship Between Risk Factors and Health Information

– For Example: no exercise percentage and Average Life Expectancy (ALE)

a.Map of no exercise and Map of ALE

#average life expectancy ALE
countycode<-substr(paste("00",Health[,2],sep=''),nchar(paste("00",Health[,2],sep=''))-2,nchar(paste("00",Health[,2],sep='')))
code<-as.numeric(paste(Health[,1],countycode,sep=''))

county.ALE<-data.frame(region=as.numeric(paste(Health[,1],countycode,sep='')),value=Health[,15])
county_choropleth(county.ALE, 
                  title  = "County Data, Average Life Expectancy (ALE)", 
                  legend = "ALE")

county.No_Exercise<-cbind.data.frame(region=code,value=Health[,19])
county_choropleth(county.No_Exercise, 
                  title  = "County Data, No_Exercise", 
                  legend = "No_Exercise")

From the map, the lighter color represent a lower value, and darker color represent a higher value. Black indicated missing value.
1. The first map shows a shorter life expectancy in the southeast of the United States. In the middle north, there is longer life expectancy.
2. The second map displays a lower rate of no exercise in the middle north, and southeast has higher rate of no exercise.
Thus, we may conclude that the rate of no exercise has some correlation with the Average life expectancy.

b. Heatmap of no exercise V.S. ALE

p1<-plot_ly(x=Health$ALE,
        y=Health$No_Exercise
          )
p1 %>% add_histogram2d(nbinsx=30, nbinsy=30)
From this heatmap, x-axis is the life expectancy, and y-axis is the rate of no exercise. We can roughly see a decreasing pattern. which means they are negatively correlated.
p2<-plot_ly(x=Health$ALE,
        y=Health$Few_Fruit_Veg
          )
p2 %>% add_histogram2d(nbinsx=30, nbinsy=30)
From this heatmap, x-axis is the life expectancy, and y-axis is the rate of Few_Fruit_Veg. We can observe that there doesn’t exist obvious linear relationship between these two variables, but we can know that relatively higher percentage of Few_Fruit_Veg will cause a shorter life expectancy. Especially, when the percentage is around 77%, the life expectancy is aroud 77.
* More relationship between risk factors and health information can be explored by using Shiny App

IV. Disease Analysis of the United States

a.States Level Diseases Death Analysis (Clevland Dot Plot)

## clevland

ggplot(tidydf, aes(Percentage, fct_reorder2(`CHSI_State_Name`, Diseases=="CHD", Percentage, .desc = FALSE), color = Diseases)) +
  geom_point() +
  ggtitle("Diseases sorted by Coronary Heart Disease Percentage") + ylab("") 

From the Clevland Dot Plot ordered by Coronary Heart Disease (CHD), we can observe all States’ five kind of diseases death percentage. From this ordered dot plot we find three distinctive groups of all five diseases. Breast cancer (female) and the colon cancer are in the first group in the lowest level of percentage to cause death. Lung cancer and stroke are in the second group with the medium level of percentage to cause death and we can still observe that stroke death is slightly higher than Lung cancer. Coronary Heart Disease (CHD) is the highest disease that cause death for all States and the number of CHD’s percentage is usually far higher than other causes.
So, from the dot plot we can conclude that Coronary Heart Disease is the most serious problem from the respect of diseases causing death.

b.Diseases Death Analysis based on Maps

     -- For Example: Coronary Heart Disease (CHD) County Level Map
county.CHD<-cbind.data.frame(region=code,value=Health[,27]/Health[,7]*100)
county_choropleth(county.CHD, 
                  title  = "County Data, CHD", 
                  legend = "CHD")
## Warning in super$initialize(map.df, user.df): Your data.frame contains the
## following regions which are not mappable: 2201, 2232, 2280
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 31075, 31115, 2105, 49009, 48033, 48301, 2198, 15005, 31005,
## 48269, 2060, 2282, 48261, 8014, 30069, 31117, 8053, 8111, 2195, 2230, 2068,
## 2013, 2275, 16025

From the above conclusion, we know that Coronary Heart Disease is the most serious problem, so in this part, we will go further and study the distribution of Coronary Heart Disease from County level based on maps.
From the map, we can observe that Coronary Heart Disease is more serious in the middle of United States because it is obviously that in the middle of the map the color is deeper.
The southwest and east coast have lower percentage of Coronary Heart Disease problem.
The northwest and the southeast have higher percentage of Coronary Heart Disease problem when compared with the situation of The southwest and east coast.
* More Information about Diseases Death can be explored by using Shiny App

VI. Interactive component

In the interactive plot, we can go further to study more relationship bewteen all variables and also can get more information about the every vable’s distribution of geographical location.

VII. Conclusion

The limitation of this study is that there are so many estimated values and missing values. If we can overcome those problems, we can get better results.
In the future, we want to study the other variable in the dataset. In particular, we could add gender and races in our analysis and see what will happen.
The lesson people can take from this report is that in order to have a longer life expectancy, we need to have a healthier lifestyle. Doing more exercises, eating more fruit and vegetables, smoke less, and having less sugar are all good ways. We also know that Coronary Heart Disease is the most series problem in the United States.